library(ggplot2)
library(dplyr)
library(rmarkdown)
library(knitr)
library(GEOquery)
library(SummarizedExperiment)
library(DESeq2)
library(here)
library(limma)
library(nlme)
library(MASS)
library(multcomp)
library(pheatmap)
library(variancePartition)
library(lme4)
library("r2glmm")
library(mclust)
library(gprofiler2)
library(enrichplot)

Introduction

Data collection

Data for the study on trivalent influenza vaccine are deposited in National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) with accession GSE48024. It was published in 2013.

A total of 119 male and 128 female between 19 to 41 years of age were recruited in the study. Whole blood samples were drawn before vaccination (day 0) and at three time points after vaccination (day 1, 3 and 14). All male samples are sequenced using Illumina HumanHT-12 V3.0 expression beadchip and all female samples are sequence using Illumina HumanHT-12 V4.0 expression beadchip.

Load data

Two separate datasets are loaded, one for female cohort and another one for male cohort.

gset <- getGEO("GSE48024", GSEMatrix =TRUE, AnnotGPL=TRUE)
## Found 2 file(s)
## GSE48024-GPL10558_series_matrix.txt.gz
## GSE48024-GPL6947_series_matrix.txt.gz
gset_female <- gset[[1]]
gset_male <- gset[[2]]

# Check gender 
unique(gset_female$`gender:ch1`) 
## [1] "female"
# Check gender 
unique(gset_male$`gender:ch1`)
## [1] "male"
# Check sequencing platform 
unique(gset_female$platform_id) #Illumina HumanHT-12 V4.0 expression beadchip
## [1] "GPL10558"
# Check sequencing platform 
unique(gset_male$platform_id) #Illumina HumanHT-12 V3.0 expression beadchip
## [1] "GPL6947"

Based on the data loaded directly from GEO, 116 out of 119 males and 110 out of 128 females are available. A total of 417 samples are from female cohort and 431 samples are from male cohort.

# Check number of subjects available in data
length(unique(gset_female$`subject:ch1`))
## [1] 110
# Check number of subjects available in data
length(unique(gset_male$`subject:ch1`))
## [1] 116
# Check number of samples available in data
ncol(gset_female)
## Samples 
##     417
# Check number of samples available in data
ncol(gset_male)
## Samples 
##     431

There are 47276 probes used for female cohort, and 48742 probes used for male cohort.

# Check number of probes
nrow(gset_female)
## Features 
##    47276
# Check number of probes
nrow(gset_male)
## Features 
##    48742

Data has been processed and normalized according to description given in GEO. We randomly select 10 samples from each cohort to check the distribution of normalized data.

# Data processing + normalization 
unique(gset_male$data_processing)
## [1] "Raw data underwent background adjustment, variance stabilization transformation, and robust spline normalization using R package lumi (available from Bioconductor) and function lumiExpresso()"
# Data processing + normalization 
unique(gset_female$data_processing)
## [1] "Raw data underwent background adjustment, variance stabilization transformation, and robust spline normalization using R package lumi (available from Bioconductor) and function lumiExpresso()"
boxplot(exprs(gset_male)[,sample(1:431, 10, replace=F)], 
        main = "Ten random samples from male cohort", ylab = "Expression")

boxplot(exprs(gset_female)[,sample(1:417, 10, replace=F)], 
        main = "Ten random samples from female cohort", ylab = "Expression")

Last, we check how many samples at each sampling time for each cohort.

# For female
table(gset_female$`time:ch1`)
## 
##  Day0  Day1 Day14  Day3 
##   107   107    98   105
# For male
table(gset_male$`time:ch1`)
## 
##  Day0  Day1 Day14  Day3 
##   111   110   109   101

Because female and male cohorts are sequenced using different platforms, all further data processing and analysis will be done separately.

Analysis plan

Pre-processing

First to setup SummarizedExperiment object.

SE_male  <- SummarizedExperiment(assay = list("exprs" = exprs(gset_male)),
                                  colData = as(gset_male@phenoData, "data.frame"),
                                  rowData = as(gset_male@featureData, "data.frame"))

SE_female  <- SummarizedExperiment(assay = list("exprs" = exprs(gset_female)),
                                  colData = as(gset_female@phenoData, "data.frame"),
                                  rowData = as(gset_female@featureData, "data.frame"))

# Create few variables with better names 
SE_male$time <- factor(SE_male$`time:ch1`, levels = c("Day0", "Day1", "Day3", "Day14"))
SE_male$ID <- SE_male$`subject:ch1`
SE_female$time <- factor(SE_female$`time:ch1`, levels = c("Day0", "Day1", "Day3", "Day14"))
SE_female$ID <- SE_female$`subject:ch1`

Filter samples

As we have seen in Data Collection section, not all participants are sampled at all four time points. Thus, we keep those who sampled at all time.

male.all <- as.data.frame(table(SE_male$ID)) %>% filter(Freq == 4) %>% rename( "Var1" = "ID")
female.all <- as.data.frame(table(SE_female$ID)) %>% filter(Freq == 4) %>% rename( "Var1" = "ID")

SE_male <- SE_male[, which(SE_male$ID %in% male.all$ID)]
SE_female <- SE_female[, which(SE_female$ID %in% female.all$ID)]

There are 92 males and 87 females sampled at all four time points. These will be used for downstream analysis.

table(SE_male$time)
## 
##  Day0  Day1  Day3 Day14 
##    92    92    92    92
table(SE_female$time)
## 
##  Day0  Day1  Day3 Day14 
##    87    87    87    87

Filter probes/genes

Step 1: We first subset to probes with corresponding gene symbols.

Probes for male cohort: 48742 -> 29228

Probes for female cohort: 47276 -> 31243

gset_male <- gset_male[which(gset_male@featureData@data[["Gene symbol"]] != ""), ]
nrow(gset_male)
## Features 
##    29228
gset_female <- gset_female[which(gset_female@featureData@data[["Gene symbol"]] != ""), ]
nrow(gset_female)
## Features 
##    31243

Step 2: Average expression for probes correponding to the same gene name.

Using limma() package.

Probes for male cohort: 29228 -> 19590

Probes for female cohort: 31243 -> 20751

dupRM_male <-  avereps(assays(SE_male)$exprs, ID=rowData(SE_male)$"Gene symbol")
nrow(dupRM_male)
## [1] 19590
dupRM_female <-  avereps(assays(SE_female)$exprs, ID=rowData(SE_female)$"Gene symbol")
nrow(dupRM_female)
## [1] 20751

Update SummarizeExperiment object with updated expression matrix.

# Make sure samples are ordered the same
all.equal(colnames(dupRM_male) , rownames(SE_male@colData))
## [1] TRUE
SE_male_dupRM <- SummarizedExperiment(assay = list("exprs" = dupRM_male),
                                  colData = SE_male@colData)

all.equal(colnames(dupRM_female) , rownames(SE_female@colData))
## [1] TRUE
SE_female_dupRM <- SummarizedExperiment(assay = list("exprs" = dupRM_female),
                                  colData = SE_female@colData)

Step 3: Keep top 75% most variance probes/genes for efficient downstream analysis.

Probes for male cohort: 19590 -> 14692

Probes for female cohort: 20751 -> 15563

rowData(SE_male_dupRM)$gene_variance <- rowVars(assays(SE_male_dupRM)$exprs)
quantile(rowData(SE_male_dupRM)$gene_variance)
##           0%          25%          50%          75%         100% 
## 5.479204e-05 2.238396e-03 3.938472e-03 1.344815e-02 2.601662e+00
SE_male_dupRM_25up <- SE_male_dupRM[which(rowData(SE_male_dupRM)$gene_variance > 0.0022386139), ]

rowData(SE_female_dupRM)$gene_variance <- rowVars(assays(SE_female_dupRM)$exprs)
quantile(rowData(SE_female_dupRM)$gene_variance)
##           0%          25%          50%          75%         100% 
## 0.0002699308 0.0070110061 0.0134392322 0.0334073424 4.5899802389
SE_female_dupRM_25up <- SE_female_dupRM[which(rowData(SE_female_dupRM)$gene_variance > 0.0070110061), ]

For downstream analysis, we use 14692 unique genes and 368 samples from the male cohort, and 15563 genes with 348 samples from the female cohort.

SE_male_dupRM_25up
## class: SummarizedExperiment 
## dim: 14692 368 
## metadata(0):
## assays(1): exprs
## rownames(14692): EEF1A1 GAPDH ... MCM10 ZNF703
## rowData names(1): gene_variance
## colnames(368): GSM1165057 GSM1165058 ... GSM1165486 GSM1165487
## colData names(43): title geo_accession ... time ID
SE_female_dupRM_25up
## class: SummarizedExperiment 
## dim: 15563 348 
## metadata(0):
## assays(1): exprs
## rownames(15563): EEF1A1 GAPDH ... MIR320C1 LINC00173
## rowData names(1): gene_variance
## colnames(348): GSM1165512 GSM1165513 ... GSM1165927 GSM1165928
## colData names(43): title geo_accession ... time ID

GMM clustering to infer response to vaccination

Data on antibody titer are deposited on dbGaP with restricted access. Thus, we use 11 differentiallyt expressed genes with respect to response given by authors to differentiate high response and low response after vaccination. These 11 genes are found using male cohort.

Gene PRDX2 are not present in both female and male cohort after data pre-processing.

DE11_response <- c("STAT1", "IRF9", "SPI1", "CD74", "HLA-E", "TNFSF13B", "PRDX2", "PRDX3", "E2F2", "PTEN", "ITGB1")

DE11_response[!DE11_response %in% rownames(SE_male_dupRM_25up)]
## [1] "PRDX2"
DE11_response[!DE11_response %in% rownames(SE_female_dupRM_25up)]
## [1] "PRDX2"

We will cluster participants based on difference between day1 and day0 expression.

On male cohort

male_exprs_day1_0 <- list( day1 = SE_male_dupRM_25up[which(rownames(SE_male_dupRM_25up) %in% DE11_response),which(SE_male_dupRM_25up$time == "Day1")],
                           day0 = SE_male_dupRM_25up[which(rownames(SE_male_dupRM_25up) %in% DE11_response),which(SE_male_dupRM_25up$time == "Day0")])
colnames(male_exprs_day1_0[["day1"]]) <- male_exprs_day1_0[["day1"]]$ID
colnames(male_exprs_day1_0[["day0"]]) <- male_exprs_day1_0[["day0"]]$ID

male_exprs_day1_0[["day1"]] <- as.matrix(assay(male_exprs_day1_0[["day1"]]))
male_exprs_day1_0[["day0"]] <- as.matrix(assay(male_exprs_day1_0[["day0"]]))

dim(male_exprs_day1_0[[1]])
## [1] 10 92
dim(male_exprs_day1_0[[2]])
## [1] 10 92
# Match the order of gene and order of participants 
male_exprs_day1_0[[2]] <- male_exprs_day1_0[[2]][match(rownames(male_exprs_day1_0[[1]]), rownames(male_exprs_day1_0[[2]])), match(colnames(male_exprs_day1_0[[1]]), colnames(male_exprs_day1_0[[2]]))]

# Subtract day 0 from day 1 for each participants
male_diff_day1_0 <- (male_exprs_day1_0[["day1"]] - male_exprs_day1_0[["day0"]])

male_diff_day1_0[,1:5]
##               IN0123      IN0126      IN0127      IN0128      IN0129
## STAT1     1.31173507  1.52068176  1.02966207  1.22048660  1.33060734
## SPI1      0.12843323 -0.08736426 -0.49973666  0.07800277  0.56261461
## PTEN     -0.26807272  0.31520774 -0.45989548  0.23892477  0.35092521
## ITGB1    -0.03593713  0.03315031 -0.28064441 -0.18637363 -0.09974113
## CD74      0.53791679 -0.11917102  0.30187478  0.32976888  0.45044398
## IRF9      0.05764101  0.91409226  0.59053525  0.74820827  0.52649910
## TNFSF13B -0.18962517  0.53251828  0.94899337  0.36595157  0.41775762
## HLA-E    -0.07087181  0.35358275  0.50044553  1.16363470  0.05931064
## E2F2      0.34867212  0.58911630 -0.04331496 -0.01942406  0.29831446
## PRDX3    -0.03653738 -0.03568729 -0.42529887 -0.09805881 -0.14152018
# GMM cluster
GMM_male <-  Mclust(t(male_diff_day1_0),  G = 2)
table(GMM_male$classification)
## 
##  1  2 
## 27 65

According to authors, genes (“STAT1”, “IRF9”, “SPI1”, “CD74”, “HLA-E”, “TNFSF13B”) are up-regulated in high responders, and remaining genes in the list are up-regulated in low responders.

We assign group 2 as high responders based on boxplots on the difference of expression between day 1 and day 0 by clusters from GMM.

# Check GMM cluster result wrt 10 genes
par(mfrow=c(2,5))
for (i in 1:nrow(male_diff_day1_0)){
  dff <- data.frame(expr = male_diff_day1_0[rownames(male_diff_day1_0)[i],], group=GMM_male$classification)
  boxplot(expr ~ group,dff, main=rownames(male_diff_day1_0)[i], ylab="diff(day1- day0)")
}

SE_male_dupRM_25up$response <- factor(GMM_male$classification[match(SE_male_dupRM_25up$ID, names(GMM_male$classification))],  levels = c(1,2))

On female cohort

Please refer to the Rmd file for code.

GMM result.

## 
##  1  2 
## 48 39

The 10 genes used for clustering are found using male cohort, and analysis using GMM shows inconsistencies when applied on female cohort. Overall, we assign group 1 as high responders and group 2 as low responders for female cohort.

Results

We first define few functions used for analysis.

mixed_test <- function(gene,gene_name, covariates, fml, var_rowname, j){
  df_test <- cbind(gene , covariates)
  colnames(df_test)[1] <- "gene"
  mix_model <- aov(fml, data=df_test)
  df_ret <- data.frame(unlist(summary(mix_model))[paste0("Error: Within.Pr(>F)",j)])
  rownames(df_ret) <- var_rowname
  colnames(df_ret) <- gene_name
  return(tibble(df_ret))
}


GO_analysis <- function(genes){
  DE <- gost(query = genes, organism = "hsapiens", multi_query = FALSE,
                     correction_method = "fdr", user_threshold = 0.01, source = "GO:BP")

  print(table(DE[["result"]][["source"]]))

  DE_mod = DE$result[,c("query",  "term_id",
                                "term_name", "p_value", "query_size", 
                                "intersection_size", "term_size", 
                                "effective_domain_size", "intersection_size", "source")]

  DE_mod$"GeneRatio" = paste0(DE_mod$intersection_size,  "/", DE_mod$query_size)
  DE_mod$BgRatio = paste0(DE_mod$term_size, "/", DE_mod$effective_domain_size)
  DE_mod$"pvalue" <- DE_mod$p_value
  DE_mod$Count <- DE_mod$intersection_size
  DE_mod$Description<- DE_mod$term_name
  DE_mod_enrich <-  new("enrichResult", result = DE_mod)

  dotplot(DE_mod_enrich , x = 'GeneRatio', color = "pvalue") + ggtitle("GO: Biological process")
}

Male cohort

Analysis with respect to time

We first fit mixed effect model, then extract p-value from testing the effect on time variable which is treated as categorical variable.

cov_male <- data.frame(ID = SE_male_dupRM_25up$ID,
                  time = SE_male_dupRM_25up$time,
                  response = SE_male_dupRM_25up$response)

genes_male <- rownames(SE_male_dupRM_25up)
mix_test_time_male <- sapply( 1: nrow(SE_male_dupRM_25up),
                            function(i) mixed_test( gene = as.vector(SE_male_dupRM_25up@assays@data@listData[["exprs"]][i,]),
                                                         gene_name = genes_male[i],
                                                         covariates = cov_male,
                                                         fml = as.formula(gene ~ time + Error(ID)),
                                                         var_rowname = c("time"),
                                                         j = 1))
mix_test_time_male <- as.data.frame(unlist(mix_test_time_male))  
colnames(mix_test_time_male) <- c("pvalue")

hist(mix_test_time_male$pvalue, main ="P-value from fix effect of time", xlab="P-value")

P-values are adjusted using Benjamini-Hotchberg method.

# BH adjusted p-value
mix_test_time_male$BH_pvalue <- p.adjust(mix_test_time_male$pvalue, method =  "BH", n = length(mix_test_time_male$pvalue))

hist(mix_test_time_male$BH_pvalue, main ="BH adjusted p-value from fix effect of time", xlab="BH adjusted p-value")

Controlling for 1% of false discovery rate, there are 4540 differentially expressed (DE) genes across time in the male cohort.

DEtime_male <- mix_test_time_male %>% as.data.frame() %>% filter(BH_pvalue < 0.01)

nrow(DEtime_male)
## [1] 4540

Next, we plot heatmap on top 50 DE genes by BH adjusted p-value. For male cohort, the top 50 DE genes are predominantly up-regulated at day 1 after vaccination. This is consistent with what authors reported in their paper.

DE50_male <- SE_male_dupRM_25up[which(rownames(SE_male_dupRM_25up) %in% rownames(DEtime_male %>% arrange(BH_pvalue))[1:50]), ]

colData(DE50_male)$time <- factor(colData(DE50_male)$time, levels = c("Day0", "Day1", "Day3", "Day14"))

DE50_male_mean <- sapply(c("Day0", "Day1", "Day3", "Day14"), function(day) rowMeans(assay(DE50_male)[, which(colData(DE50_male)$time == day)]))

pheatmap(DE50_male_mean, cluster_rows = TRUE,  cluster_cols = FALSE, scale = "row", 
          main = "Mean expression of top 50 DE genes wrt time by BH adjusted p-values")

Next, we perform Gene Ontology enrichment analysis on the 4540 DE genes. Specifically, we are interested in biological process.

There are 1550 GO terms showed significant association with 4540 DE genes found with respect to time. Some of the top GO term related to immune response and stress are what we expected after vaccination.

GO_analysis(rownames(DEtime_male))
## 
## GO:BP 
##  1550

Analysis with respect to response across time

Variable response is inferred using GMM in previous section.

mix_test_response_time_male <- sapply( 1: nrow(SE_male_dupRM_25up),
                            function(i) mixed_test( gene = as.vector(SE_male_dupRM_25up@assays@data@listData[["exprs"]][i,]),
                                                         gene_name = genes_male[i],
                                                         covariates = cov_male,
                                                         fml = as.formula(gene ~ time*response + Error(ID)),
                                                         var_rowname = c("time"),
                                                         j = 2))

mix_test_response_time_male <- as.data.frame(unlist(mix_test_response_time_male))

colnames(mix_test_response_time_male) <- c("pvalue")

hist(mix_test_response_time_male$pvalue, main ="P-value from fix effect of time*response", xlab="P-value")

# BH adjusted p-value
mix_test_response_time_male$BH_pvalue <- p.adjust(mix_test_response_time_male$pvalue, method =  "BH", n = length(mix_test_response_time_male$pvalue))

hist(mix_test_response_time_male$BH_pvalue, main ="BH adjusted p-value from fix effect of time*response", xlab="BH adjusted p-value")

Controlling for 5% of false discovery rate, there are 573 differentially expressed (DE) genes across time in the male cohort.

DEresponse_male <- mix_test_response_time_male %>% as.data.frame() %>% filter(BH_pvalue < 0.05)

nrow(DEresponse_male)
## [1] 573

We plot heatmap on top 50 DE genes by BH adjusted p-value. For male cohort, the top 50 DE genes are predominantly up-regulated at day 1 among high responders after vaccination. There are few genes that are down-regulated at day 1among high responders after vaccination.

DE50response_male <- SE_male_dupRM_25up[which(rownames(SE_male_dupRM_25up) %in% rownames(DEresponse_male %>% arrange(BH_pvalue))[1:50]), ]

colData(DE50response_male)$time <- factor(colData(DE50response_male)$time, levels = c("Day0", "Day1", "Day3", "Day14"))

DE_response1_mean_male <- sapply(c( "Day0", "Day1", "Day3", "Day14"), function(day) rowMeans(assay(
DE50response_male)[, which(colData(
DE50response_male)$time == day & DE50response_male$response == 1)]))
colnames(DE_response1_mean_male) <- paste0(colnames(DE_response1_mean_male), "_low")

DE_response2_mean_male <- sapply(c( "Day0", "Day1", "Day3", "Day14"), function(day) rowMeans(assay(
DE50response_male)[, which(colData(
DE50response_male)$time == day & DE50response_male$response == 2)]))
colnames(DE_response2_mean_male) <- paste0(colnames(DE_response2_mean_male), "_high")

Group_male <- data.frame( Group = c(rep("Low responder",4), rep("High responder", 4)))
rownames(Group_male) <- c(colnames(DE_response1_mean_male), colnames(DE_response2_mean_male))

pheatmap(cbind(DE_response1_mean_male,DE_response2_mean_male), 
         cluster_rows = TRUE,  cluster_cols = FALSE, scale = "row", 
         main = "Mean expression on top 50 DE genes wrt response over time", 
         annotation_col = Group_male)

There are 483 GO terms showed significant association with 573 DE genes found with respect to response over time. The top GO terms are mostly related to immune response.

GO_analysis(rownames(DEresponse_male))
## 
## GO:BP 
##   483

Female cohort

Please refer to Rmd file for code.

Analysis with respect to time

We fit mixed effect model, and extract p-value on testing the fixed effect of interaction between response and time. P-values are further adusted using Benjamini-Hotchberg method.

Controlling for 1% of false discovery rate, there are 6118 differentially expressed (DE) genes across time in the male cohort.

## [1] 6118

We plot heatmap on top 50 DE genes by BH adjusted p-value. For female cohort, the top 50 DE genes are all up-regulated at day 1 after vaccination.

There are 1682 GO terms showed significant association with 6118 DE genes found with respect to time. Some of the top GO term related to immune response and stress are what we expected after vaccination.

## 
## GO:BP 
##  1682

Analysis with respect to response across time

Response variable is inferred using GMM in previous section. We fit a mixed effect model including interaction between response and time as fixed effect. Then p-values on testing the effect of the interaction is extracted, then adjusted using Benjamini-Hotchberg method.

Controlling for 5% of false discovery rate, there are 3217 differentially expressed (DE) genes across time in the male cohort.

## [1] 3217

We plot heatmap on top 50 DE genes by BH adjusted p-value. Among the top 50 DE genes, majority are down-regulated at all time points among high responders after vaccination. This is different than the result shown for male analysis.

There are 730 GO terms showed significant association with 3217 DE genes found with respect to response over time. The top GO terms are different as compared to the analysis on male cohort.

## 
## GO:BP 
##   730

Conclusion and discussion